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Abstract — We consider the problem of efficiently constructing 
polar codes over binary memoryless symmetric (BMS) channels. 
The complexity of designing polar codes via an exact evaluation 
of the polarized channels to find which ones are "good" appears 
to be exponential in the block length. In |3|, Tal and Vardy show 
that if instead the evaluation if performed approximately, the 
construction has only linear complexity. In this paper, we follow 
this approach and present a framework where the algorithms of 
|3| and new related algorithms can be analyzed for complexity 
and accuracy. We provide numerical and analytical results 
on the efficiency of such algorithms, in particular we show 
that one can find all the "good" channels (except a vanishing 
fraction) with almost linear complexity in block-length (except a 
polylogarithmic factor). 

I. Introduction 

A. Polar Codes 

Polar coding, introduced by Ankan in (T], is an encod- 
ing/decoding scheme that provably achieves the capacity of 
the class of BMS channels. Let W be a BMS channel. Given 
the rate R < I(W), polar coding is based on choosing 
a set of 2 n R rows of the matrix G n = [ii]* 8 ™ to form 
a 2 n R x 2" matrix which is used as the generator matrix 
in the encoding procedural! The way this set is chosen is 
dependent on the channel W and uses a phenomenon called 
channel polarization: Consider an infinite binary tree and place 
the underlying channel W on the root node and continue 
recursively as follows. Having the channel P : {0, 1} — > y 
on a node of the tree, define the channels P~ : {0, 1} — > y 2 
and P+ : {0, 1} -> {0, 1} x y 2 

p-(yi,y 2 \xi)= \ p (vi\ x i ® x 2) p (vM (1) 

X2&{0,1} 

P + {yuV2,x 1 \x 2 ) = ^P{yi\x 1 ®x 2 )P(y 2 \x 2 ), (2) 

and place P~ and P + as the left and right children of this 
node. As a result, at level n there are N — 2" channels 
which we denote from left to right by to W$ . In 
JTJ, Ankan proved that as n — > oo, a fraction approaching 
I(W) of the channels at level n have capacity close to 1 
(call them "noiseless" channels) and a fraction approaching 
1 — I(W) have capacity close to (call them "completely 

1 There are extensions of polar codes given in [2] which use different kinds 
of matrices. 



noisy" channels). Given the rate R, the indices of the matrix 
G n are chosen as follows: choose a subset of the channels 
{W$}i<i<N with the most mutual information and choose 

the rows G„ with the same indices as these channels. For 

(?) 

example, if the channel Wfj is chosen, then the j-th row 
of G n is selected, up to the bit-reversal permutation. In the 
following, given n, we call the set of indices of NR channels 
with the most mutual information, the set of good indices. 

We can equivalently say that as n — > oo the fraction of 
channels with Bhattacharyya constant near approaches I(W) 
and the fraction of channels with Bhattacharyya constant near 
1 approaches 1 — I(W). The Bhattacharyya constant of a 
channel P : {0,1} ^ y is given by 

Z(P) = Yl VpWWW)- (3) 

yey 

Therefore, we can alternatively call the set of indices of NR 
channels with least Bhattacharyya parameters, the set of good 
indices. It is also interesting to mention that the sum of the 
Bhattacharyya parameters of the chosen channels is an upper 
bound on the block error probability of polar codes when we 
use the successive cancellation decoder. 

B. Problem Formulation 

Designing a polar code is equivalent to finding the set of 
good indices. The main difficulty in this task is that, since 
the output alphabet of W$ is y N x {0, 1}\ the cardinality 
of the output alphabet of the channels at the level n of the 
binary tree is doubly exponential in n or is exponential in the 
block-length. So computing the exact transition probabilities 
of these channels seems to be intractable and hence we need 
some efficient methods to "approximate" these channels. 

In UJ, it is suggested to use a Monte-Carlo method for 
estimating the Bhattacharyya parameters. Another method in 
this regard is by quantization [3j, (4), 0, |j6] Appendix B]: 
approximating the given channel with a channel that has fewer 
output symbols. More precisely, given a number k, the task 
is to come up with efficient methods to replace channels that 
have more that k outputs with "close" channels that have at 
most k outputs. Few comments in this regard are the following: 

• The term "close" above depends on the definition of the 
quantization error which can be different depending on 
the context. In our problem, in its most general setting 



we can define the quantization error as the difference 
between the true set of good indices and the approximate 
set of good indices. However, it seems that analyzing 
this type of error may be difficult and in the sequel we 
consider types of errors that are easier to analyze. 

• Thus, as a compromise, will intuitively think of two 
channels as being close if they are close with respect 
to some given metric; typically mutual information but 
sometimes probability of error. More so, we require that 
this closeness is in the right direction: the approximated 
channel must be a "pessimistic" version of the true 
channel. Thus, the approximated set of good channels 
will be a subset of the true set. 

• Intuitively, we expect that as k increases the overall error 
due to quantization decreases; the main art in designing 
the quantization methods is to have a small error while 
using relatively small values of k. However, for any 
quantization algorithm an important property is that as 
k grows large, the approximate set of good indices using 
the quantization algorithm with k fixed approaches the 
true set of good indices. We give a precise mathematical 
definition in the sequel. 

Taking the above mentioned factors into account, a suitable 
formulation of the quantization problem is to find procedures 
to replace each channel P at each level of the binary tree 
with another symmetric channel P with the number of output 
symbols limited to k such that firstly, the set of good indices 
obtained with this procedure is a subset of the true good 
indices obtained from the channel polarization i.e. channel P 
is polar degraded with respect to P, and secondly the ratio 
of these good indices is maximized. More precisely, we start 
from channel W at the root node of the binary tree, quantize it 
to W and obtain W~ and W + according to (HJ and (0. Then, 
we quantize the two new channels and continue the procedure 
to complete the tree. To state things mathematically, let Qk 
be a quantization procedure that assigns to each channel P a 
binary symmetric channel P such that the output alphabet of 
P is limited to a constant k. We call Qk admissible if for any 
i and n 

I(W$) < I{W$). (4) 
One can alternatively call Qk admissible if for any i and n 

Z(W$) > Z(W$). (5) 

Note that and (01 are essentially equivalent as N grows 
large. Given an admissible procedure Qk and a BMS channel 
W, let p(Q k ,W) bd3 

n— >oo jv 

So the quantization problem is that given a number k £ N 
and a channel W, how can we find admissible procedures Qk 
such that p(Qk, W) is maximized and is close to the capacity 
of W. Can we reach the capacity of W as k goes to infinity? 

2 Instead of 1 in (6} we can use any number in (0, 1). 



Are such schemes universal in the sense that they work well 
for all the BMS channels? It is worth mentioning that if we 
first let k tend to infinity and then n to infinity then the limit is 
indeed the capacity, but we are addressing a different question 
here, namely we first let n tend to infinity and then k (or 
perhaps couple fc to n). In Section [IV] we indeed prove that 
such schemes exist. 

II. Algorithms for Quantization 

A. Preliminaries 

Any discrete BMS channel can be represented as a collec- 
tion of binary symmetric channels (BSC's). The binary input is 
given to one of these BSC's at random such that the i-th BSC 
is chosen with probability p,. The output of this BSC together 
with its cross over probability xi is considered as the output 
of the channel. Therefore, a discrete BMS channel W can be 
completely described by a random variable \ € [0, 1/2]. The 
pdf of x will be of the form: 

m 

P x (x) = J2Pi S i x - x i) C7) 
»=i 

such that YZLiPi = 1 and < < 1/2. Note that Z(W) 
and 1 — I(W) are expectations of the functions f(x) = 
2\/x(l — x) and g(x) — —x log(x) — (1 — x] log(l — x) over 
the distribution P x , respectively. 

Therefore, in the quantization problem we want to replace 
the mass distribution P x with another mass distribution P$ 
such that the number of output symbols of x is at most fc, and 
the channel W is polar degraded with respect to W. We know 
that the following two operations imply polar degradation: 

• Stochastically degrading the channel. 

• Replacing the channel with a BEC channel with the same 
Bhattacharyya parameter. 

Furthermore, note that the stochastic dominance of random 
variable x with respect to x implies W is stochastically 
degraded with respect to W. (But the reverse is not true.) 

In the following, we propose different algorithms based on 
different methods of polar degradation of the channel. The first 
algorithm is a naive algorithm called the mass transportation 
algorithm based on the stochastic dominance of the random 
variable x, and the second one which outperforms the first 
is called greedy mass merging algorithm. For both of the 
algorithms the quantized channel is stochastically degraded 
with respect to the original one. 

B. Greedy Mass Transportation Algorithm 

In the most general form of this algorithm we basically look 
at the problem as a mass transport problem. In fact, we have 
non-negative masses pi at locations Xi,i = 1, • • • , m, x% < 
■ ■■ < x m . What is required is to move the masses, by only 
moves to the right, to concentrate them on k < m locations, 
and try to minimize J^iPidi where di = Xi+\ — Xi is the 
amount i th mass has moved. Later, we will show that this 
method is not optimal but useful in the theoretical analysis of 
the algorithms that follow. 



Algorithm 1 Mass Transportation Algorithm 



, Or, 



Start from the list (pi, 
Repeat m — k times 
Find j = aigmin{pidi : i ^ m} 
Add pj to Pj+i (i.e. move pj to Xj+i) 
Delete (pj.xj) from the list. 



Note that Algorithm [TJ is based on the stochastic dominance 
of random variable x w i m respect to \- Furthermore, in 
general, we can let di = f(xi + i) — f(xi), for an arbitrary 
increasing function /. 

C. Mass Merging Algorithm 

The second algorithm merges the masses. Two masses p\ 
and p2 at positions x\ and x-i would be merged into one mass 



Pi + P2 at position x\ 



-X\ + 



-X2- This algorithm 



P1+P2 1 P1+P2' 
is based on the stochastic degradation of the channel, but the 

random variable x is not stochastically dominated by x- The 

greedy algorithm for the merging of the masses would be the 

following: 

Algorithm 2 Merging Masses Algorithm 



Start from the list (p%,xx), ■ ■ ■ , (p m , x m ) 
Repeat m — k times 
Find j = argmin{K(/(x i ) - /(af»)) 
f(Xi)) : i^m} x. 



— Pi 



Pi+i(f{x i+1 ) - 

j+i x 

4: Replace the two masses (j>j,Xj) and {pj+i, Xj+i) with a 
single mass (pj + pj + i,Xj). 



Note that in practice, the function / can be any increasing 
concave function, for example, the entropy function or the 
Bhattacharyya function. In fact, since the algorithm is greedy 
and suboptimal, it is hard to investigate explicitly how chang- 
ing the function / will affect the total error of the algorithm 
in the end (i.e., how far W is from W). 

III. Bounds on the Approximation Loss 

In this section, we provide some bounds on the maximum 
approximation loss we have in the algorithms. We define 
the "approximation loss" to be the difference between the 
expectation of the function / under the true distribution P x 
and the approximated distribution P^. Note that the kind of 
error that is analyzed in this section is different from what was 
defined in Section H-Bl The connection of the approximation 
loss with the quantization error is made clear in Theorem [TJ 
For convenience, we will simply stick to the word "error" 
instead of "approximation loss" from now on. 

We first find an upper bound on the error made in Algo- 
rithms [TJ and |2] and then use it to provide bounds on the error 
made while performing operations <[TJ and (0. 

Lemma 1. The maximum error made by Algorithms\I\and\2\ 

is upper bounded by O(^). 



Proof: First, we derive an upper bound on the error of 
Algorithms [TJ and [2] in each iteration, and therefore a bound 
on the error of the whole process. Let us consider Algorithm 
[TJ The problem can be reduced to the following optimization 
problem: 



e = max min (pidi) 



Pi 



such that 



(8) 



(9) 



where di = /(xj+i) — f(xi), and /(|) — /(0) = 1 is assumed 
w.l.o.g. We prove the lemma by Cauchy-Schwarz inequality. 



min pidi = pidij = (mm \fpi<k j 

Now by applying Cauchy-Schwarz we have 

i=l \i=l / \i=l / 



(10) 



(11) 



Since the sum of m terms \fp~idl is less than 1, the minimum 

of the terms will certainly be less than — . Therefore, 

J m 

e = ( min ^ pidi) < -\. (12) 
V / m z 

For Algorithm [2] achieving the same bound as Algorithm [TJ 
is trivial. Denote the error made in Algorithm [TJ and e*- 2 - 1 
the error made in Algorithm [2] Then, 



ef ] = Pl (f( Xi ) - fix,)) - Pl+1 (f(x i+1 ) - f( Xi )) (13) 

(14) 

(15) 



< Pi (f(xi) - f{xi)) 
<p l (f(x i+ i)-f(xi)) = e ( i 1) . 



Consequently, the error generated by running the whole 
algorithm can be upper bounded by Yl^Lk+i 7 1 wn i cn i s 

■ 

What is stated in Lemma [TJ is a loose upper bound on the 
error of Algorithm [2] To achieve better bounds, we upper 
bound the error made in each iteration of the Algorithm [2] 
as the following: 

e-i = Pi (f(xi) - f(xi)) - Pi+i [f{x i+ i) - f{xi)) (16) 
< Pi __ — Axif'(xi) - Pi+i — 5 Axif'(x i+1 ) 



Pi +P1+1 
PiPi+i 



p t + pi+i 



Pi +P1+1 
Axi(f(xi)-f(xi +1 )) 



< 



-Az 2 |/"( Ci )|, 



(17) 
(18) 

(19) 



where Axi = Xi + i — Xi and ( TTTb is due to concavity of 
function /. Furthermore, ( fT9b is by the mean value theorem, 
where < < x i+ i. 

If |/"(^)| is bounded for x E (0, 1), then we can prove that 
min^ ei ~ 0(^3) similarly to Lemma [TJ Therefore the error 



of the whole algorithm would be O(-p-). Unfortunately, this 
is not the case for either of entropy function or Bhattacharyya 
function. However, we can still achieve a better upper bound 
for the error of Algorithm [2] 

Lemma 2. The maximum error made by Algorithm^ for the 
entropy function h(x) can be upper bounded by the order of 

Proof: See Appendix. ■ 

We can see that the error is improved by a factor of 
in comparison with Algorithm Q] 

Now we use the result of Lemma Q] to provide bounds on 
the total error made in estimating the mutual information of a 
channel after n levels of operations (HJ and (0. 

Theorem 1. Assume W is a BMS channel and using Algo- 
rithm \l]or\2\we quantize the channel W to a channel W. 
Taking k = n 2 is sufficient to give an approximation error 
that decays to zero. 

Proof: First notice that for any two BMS channels W and 
V, doing the polarization operations (Q~|i and (O, the following 
is true: 

(I(W-) I(V-)) + (I(W+) I(V+)) = 2(I(W) I(V)) 

(20) 

Replacing V with W in d20l > and using the result of Lemma 
[U we conclude that after n levels of polarization the sum of 
the errors in approximating the mutual information of the 2™ 
channels is upper-bounded by (D( n j-). In particular, taking 
k = n 2 , one can say that the "average" approximation error 
of the 2™ channels at level n is upper-bounded by 0(— ). 
Therefore, at least a fraction 1 — A= of the channels are 
distorted by at most ^= i.e., except for a negligible fraction of 
the channels the error in approximating the mutual information 
decays to zero. 

■ 

As a result, since the overall complexity of the encoder con- 
struction is 0(k 2 N), this leads to "almost linear" algorithms 
for encoder construction with arbitrary accuracy in identifying 
good channels. 

IV. Exchange of Limits 

In this section, we show that there are admissible schemes 
such that as k — > oo, the limit in © approaches I(W) for 
any BMS channel W . We use the definition stated in (f5]l for 
the admissibility of the quantization procedure. 

Theorem 2. Given a BMS channel W and for large enough 
k, there exist admissible quantization schemes Qk such that 
p{QkiW) is arbitrarily close to I(W). 

Proof: Consider the following algorithm: The algorithm 
starts with a quantized version of W and it does the nor- 
mal channel splitting transformation followed by quantization 
according to Algorithm \T\ or [2] but once a sub-channel is 
sufficiently good, in the sense that its Bhattacharyya parameter 



is less than an appropriately chosen parameter 5, the algorithm 
replaces the sub-channel with a binary erasure channel which 
is degraded (polar degradation) with respect to it (As the 
operations ((TJ and (O over an erasure channel also yields 
and erasure channel, no further quantization is need for the 
children of this sub-channel). 

Since the ratio of the total good indices of BEC(Z (P)) is 
1— Z(P), then the total error that we make by replacing P with 
BEC(Z (P)) is at most Z(P) which in the above algorithm is 
less that the parameter 5. 

Now, for a fixed level n, according to Theorem Q] if we 
make k large enough, the ratio of the quantized sub-channels 
that their Bhattacharyya value is less that 5 approaches to 
its original value (with no quantization), and for these sub- 
channels as explained above the total error made with the 
algorithm is 6. Now from the polarization theorem and by 
sending 6 to zero we deduce that as k — » oo the number of 
good indices approaches the capacity of the original channel. 

■ 

V. Simulation Results 

In order to evaluate the performance of our quantization 
algorithm, similarly to J3], we compare the performance of 
the degraded quantized channel with the performance of an 
upgraded quantized channel. An algorithm similar to Algo- 
rithm |2] for upgrading a channel is the following. Consider 
three neighboring masses in positions {x%-i, Xi, Xi+i) with 
probabilities (jpi-i,Pi,Pi+i). Let t = x ^7-x"T-i ■ Then ' we 
split the middle mass at Xi to the other two masses such that 
the final probabilities will be (pi_i + (1 — t)pi,pi + i + tpi) at 
positions (xi-i, Xi+i). The greedy channel upgrading proce- 
dure is described in Algorithm [3] 



Algorithm 3 Splitting Masses Algorithm 

1: Start from the list (pi, a;i), •• • 7 (p m ,x m ). 
2: Repeat m — k times 

3: Findj = argmin{pi(f(xi)-tf(x i+ i)-(l-t)f(xi-i)) : 
i 1, to} 

4: Add (1 — t)pj to Pj-% and tpj to Pj+i- 
5: Delete (pj,Xj) from the list. 



The same upper bounds on the error of this algorithm 
can be provided similarly to Section [Til] with a little bit of 
modification. 

In the simulations, we measure the maximum achievable 
rate while keeping the probability of error less than 10 -3 
by finding maximum possible number of channels with the 
smallest Bhattacharyya parameters such that the sum of their 
Bhattacharyya parameters is upper bounded by 10~ 3 . The 
channel is a binary symmetric channel with capacity 0.5. Us- 
ing Algorithms |2] and |3] for degrading and upgrading the chan- 
nels with the Bhattacharyya function f(x) = 2^/x(l — x), we 
obtain the following results: 

It is worth restating that the algorithm runs in complexity 
0(k 2 N). Table U shows the achievable rates for Algorithms 



k 


2 


4 


8 


16 


32 


64 


degrade 
upgrade 


0.2895 
0.4590 


0.3667 
0.3943 


0.3774 
0.3836 


0.3795 
0.3808 


0.3799 
0.3802 


0.3800 
0.3801 



TABLE I: Achievable rate with error probability at most 10 
maximum number of output symbols k for block-length TV = 



vs. 

i r > 



|2] and [3] when the block-length is fixed to N = 2 15 and k 
changes in the range of 2 to 64. 

It can be seen from Table |T] that the difference of achievable 
rates within the upgraded and degraded version of the scheme 
is as small as 10~ 4 for k = 64. We expect that for a fixed k, 
as the block-length increases the difference will also increase 
(see Table Hill. 



n 


5 


8 


11 


14 


17 


20 


degrade 
upgrade 


0.1250 
0.1250 


0.2109 
0.2109 


0.2969 
0.2974 


0.3620 
0.3633 


0.4085 
0.4102 


0.4403 
0.4423 



TABLE II: Achievable rate with error probability at most 10 
block-length N = 2 n for k = 16 



However, in our scheme this difference will remain small 
even as N grows arbitrarily large as predicted by Theorem [2] 
(see Table |mj. 



n 


21 


22 


23 


24 


25 


degrade 
upgrade 


0.4484 
0.4504 


0.4555 
0.4575 


0.4616 
0.4636 


0.4669 
0.4689 


0.4715 
0.4735 



TABLE III: Achievable rate with error probability at most 10 vs. 
block-length N = 2™ for k = 16 



We see that the difference between the rate achievable in the 
degraded channel and upgraded channel gets constant 2 x 10 -3 
even after 25 levels of polarizations for k = 16. 

Appendix 

A. Proof of Lemma [2] 

Proof: Let us first find an upper bound for the second 
derivative of the entropy function. Suppose that h(x) = 
— x\og{x) — (1 — x)log(l — x). Then, for < x < \, we 
have 

lh " {x)l = s(l-s)ln(2) " (21) 

Using (f2Tb the minimum error can further be upper bounded 
by 

mine; < min(p; + p i+ x)Ax1 



Xi ln(4) ' 



(22) 



Now suppose that we have I mass points with Xi < ^= and 
m — I mass points with Xi > -^=. For the first I mass points 
we use the upper bound obtained for Algorithm Q] Hence, for 
1 < i < I we have 



mine,; < mhipiAh(xi) 

i i 

'log(m) 



O 



(23) 
(24) 



where ( 1231 is due to (fl~5T > and d24"l i can be derived again by 
applying Cauchy-Schwarz inequality. Note that this time 



- 1 
VA/i(xj) < h{— 



o 



log(m) 



For the m — I mass points one can write 

minei < min(pj + p i+ i)Axf 

i i 

< mm(p t +p i+1 )Axf 



1 



Xi ln(4) 
rn 



ln(4) 



O 



(25) 

(26) 
(27) 
(28) 



(m — I) 3 J ' 

where (|28] i is due to Holder's inequality as follows: 

Let qi = pi + p l+1 . Therefore, J2i(Pi + Pi+i) < 2 and 
Ax, < 1/2. 

= (min (g,A^) 1/3 



min qi Axf = ( ^min qi Ax' t 



1/3 



Now by applying Holder's inequality we have 

/ \ 1/3 / \ 2/3 

2\V3 



< 



i \ i / \ i 

Therefore, 

min ei < y/m (miii(qiAx^ ) 1 ^ 3 ) ~ O 



(29) 



< 1 (30) 



(31) 



\ i "' J \(m — I) 3 

Overall, the error made in the first step of the algorithm 



would be 



. .'log(m) . 
mm &i ~ mm \ V \ — — t= \ , V 



O 



in 



(m - 3 



log(m) 



,2.5 



(32) 
(33) 



Thus, the error generated by running the whole algorithm can 
be upper bounded by '-^0 ~ O (^) . 
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